Cohen’s Kappa (cohen_kappa_score)#
Cohen’s kappa (\(\kappa\)) measures agreement beyond chance between two labelers. In ML, it’s commonly used to evaluate a classifier against ground truth, especially under class imbalance.
\(p_o\) = observed agreement (diagonal of the confusion matrix; for classifier-vs-truth it equals accuracy)
\(p_e\) = expected agreement under independent marginals (chance agreement given the label prevalences)
Range: \(\kappa\in[-1, 1]\) (1 = perfect agreement, 0 = chance-level, <0 = worse than chance).
Goals#
Build intuition for why \(\kappa\) differs from accuracy
Derive \(\kappa\) from the confusion matrix (binary + multiclass)
Implement \(\kappa\) from scratch in NumPy (including weighted \(\kappa\) for ordinal labels)
Use \(\kappa\) in a simple optimization loop (threshold selection for a logistic regression model)
Prerequisites#
Confusion matrix basics
Basic probability (marginals, independence)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import train_test_split
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Why \(\kappa\) (and not just accuracy)?#
Accuracy counts raw agreement. If the data are imbalanced, a trivial classifier can look “good” by mostly predicting the majority class.
\(\kappa\) adjusts for the agreement you would expect by chance, given the label frequencies.
# A trivial “majority class” classifier
y_true = np.r_[np.zeros(950, dtype=int), np.ones(50, dtype=int)]
y_pred = np.zeros_like(y_true)
accuracy = (y_true == y_pred).mean()
kappa = cohen_kappa_score(y_true, y_pred)
print(f"accuracy = {accuracy:.3f}")
print(f"kappa = {kappa:.3f}")
accuracy = 0.950
kappa = 0.000
2) The math (confusion matrix → \(\kappa\))#
Let there be \(K\) classes and a confusion matrix \(C\in\mathbb{N}^{K\times K}\) where \(C_{ij}\) counts samples with true class \(i\) predicted as class \(j\). Let \(N = \sum_{i,j} C_{ij}\).
Observed agreement (diagonal mass): $\( p_o = \frac{1}{N}\sum_{i=1}^K C_{ii} \)$
Define the marginal class frequencies (as proportions): $\( r_i = \frac{1}{N}\sum_{j=1}^K C_{ij} \quad (\text{true prevalence}), \qquad c_i = \frac{1}{N}\sum_{j=1}^K C_{ji} \quad (\text{predicted prevalence}). \)$
If true and predicted labels were independent but kept the same marginals, the expected agreement is: $\( p_e = \sum_{i=1}^K r_i\,c_i \)$
Finally: $\( \kappa = \frac{p_o - p_e}{1 - p_e} \)$
Notes:
\(\kappa=1\) when \(p_o=1\).
\(\kappa=0\) when \(p_o=p_e\) (no better than chance under the marginals).
\(\kappa<0\) when agreement is worse than chance.
def confusion_matrix_numpy(
y_true: np.ndarray,
y_pred: np.ndarray,
labels: np.ndarray | None = None,
) -> tuple[np.ndarray, np.ndarray]:
"""Confusion matrix C where C[i, j] counts true=labels[i], pred=labels[j]."""
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError("y_true and y_pred must have the same shape.")
if labels is None:
labels = np.unique(np.concatenate([y_true, y_pred]))
labels = np.asarray(labels)
label_to_index = {label: i for i, label in enumerate(labels.tolist())}
true_idx = np.array([label_to_index[label] for label in y_true], dtype=int)
pred_idx = np.array([label_to_index[label] for label in y_pred], dtype=int)
k = labels.size
cm = np.zeros((k, k), dtype=int)
np.add.at(cm, (true_idx, pred_idx), 1)
return cm, labels
def kappa_components_from_cm(cm: np.ndarray) -> tuple[float, float]:
"""Return (p_o, p_e) computed from a confusion matrix."""
cm = np.asarray(cm)
n = cm.sum()
if n == 0:
raise ValueError("Confusion matrix has zero total count.")
p_o = float(np.trace(cm) / n)
row = cm.sum(axis=1) / n
col = cm.sum(axis=0) / n
p_e = float(row @ col)
return p_o, p_e
def kappa_weight_matrix(n_classes: int, weights: str | np.ndarray | None) -> np.ndarray:
"""Disagreement weights W (0 on diagonal)."""
if n_classes < 1:
raise ValueError("n_classes must be >= 1")
if weights is None:
w = np.ones((n_classes, n_classes), dtype=float)
np.fill_diagonal(w, 0.0)
return w
if isinstance(weights, str):
if weights not in {"linear", "quadratic"}:
raise ValueError("weights must be None, 'linear', 'quadratic', or a (K,K) array.")
if n_classes == 1:
return np.zeros((1, 1), dtype=float)
idx = np.arange(n_classes)
i, j = np.meshgrid(idx, idx, indexing="ij")
if weights == "linear":
return (np.abs(i - j) / (n_classes - 1)).astype(float)
return (((i - j) ** 2) / ((n_classes - 1) ** 2)).astype(float)
w = np.asarray(weights, dtype=float)
if w.shape != (n_classes, n_classes):
raise ValueError(f"custom weights must have shape ({n_classes},{n_classes})")
return w
def cohen_kappa_score_numpy(
y_true: np.ndarray,
y_pred: np.ndarray,
labels: np.ndarray | None = None,
weights: str | np.ndarray | None = None,
) -> float:
"""Cohen's kappa from scratch (unweighted or weighted)."""
cm, labels = confusion_matrix_numpy(y_true, y_pred, labels=labels)
n = cm.sum()
if n == 0:
raise ValueError("No samples.")
o = cm / n
row = o.sum(axis=1)
col = o.sum(axis=0)
e = np.outer(row, col)
w = kappa_weight_matrix(labels.size, weights)
denom = float((w * e).sum())
num = float((w * o).sum())
# Degenerate case: expected disagreement is zero (perfectly concentrated marginals)
if np.isclose(denom, 0.0):
return 1.0
return 1.0 - (num / denom)
# Sanity check: match scikit-learn
y_true = rng.integers(0, 4, size=1_000)
y_pred = rng.integers(0, 4, size=1_000)
k_np = cohen_kappa_score_numpy(y_true, y_pred)
k_sk = cohen_kappa_score(y_true, y_pred)
print("unweighted", k_np, k_sk)
assert np.isclose(k_np, k_sk)
k_np_lin = cohen_kappa_score_numpy(y_true, y_pred, weights="linear")
k_sk_lin = cohen_kappa_score(y_true, y_pred, weights="linear")
print("linear ", k_np_lin, k_sk_lin)
assert np.isclose(k_np_lin, k_sk_lin)
k_np_quad = cohen_kappa_score_numpy(y_true, y_pred, weights="quadratic")
k_sk_quad = cohen_kappa_score(y_true, y_pred, weights="quadratic")
print("quadratic", k_np_quad, k_sk_quad)
assert np.isclose(k_np_quad, k_sk_quad)
unweighted -0.006566010065659933 -0.006566010065659933
linear -0.02898015572377166 -0.02898015572377166
quadratic -0.05021724810620465 -0.050217248106204426
3) Visual intuition: observed vs expected agreement#
\(\kappa\) depends on both:
how much mass is on the diagonal (observed agreement)
how concentrated the marginals are (chance agreement)
Let’s look at a small multiclass example.
def plot_confusion_matrix(cm: np.ndarray, labels: np.ndarray, title: str) -> go.Figure:
fig = px.imshow(
cm,
x=[str(l) for l in labels],
y=[str(l) for l in labels],
text_auto=True,
color_continuous_scale="Blues",
title=title,
labels={"x": "pred", "y": "true", "color": "count"},
)
fig.update_yaxes(autorange="reversed")
fig.update_layout(coloraxis_showscale=False)
return fig
def plot_true_vs_pred_marginals(cm: np.ndarray, labels: np.ndarray, title: str) -> go.Figure:
n = cm.sum()
true_marg = cm.sum(axis=1) / n
pred_marg = cm.sum(axis=0) / n
fig = go.Figure()
fig.add_trace(go.Bar(x=[str(l) for l in labels], y=true_marg, name="true", opacity=0.8))
fig.add_trace(go.Bar(x=[str(l) for l in labels], y=pred_marg, name="pred", opacity=0.8))
fig.update_layout(
barmode="group",
title=title,
xaxis_title="class",
yaxis_title="proportion",
)
fig.update_yaxes(range=[0, 1])
return fig
labels = np.array([0, 1, 2])
y_true = np.r_[np.zeros(200, dtype=int), np.ones(150, dtype=int), np.full(50, 2, dtype=int)]
y_pred = y_true.copy()
flip_idx = rng.choice(y_true.size, size=80, replace=False)
y_pred[flip_idx] = (y_true[flip_idx] + rng.integers(1, 3, size=flip_idx.size)) % 3
cm, labels_used = confusion_matrix_numpy(y_true, y_pred, labels=labels)
p_o, p_e = kappa_components_from_cm(cm)
kappa = cohen_kappa_score_numpy(y_true, y_pred, labels=labels)
print(f"p_o (observed) = {p_o:.3f}")
print(f"p_e (expected) = {p_e:.3f}")
print(f"kappa = {kappa:.3f}")
plot_confusion_matrix(cm, labels_used, title="Confusion matrix (example)")
p_o (observed) = 0.800
p_e (expected) = 0.390
kappa = 0.672
plot_true_vs_pred_marginals(cm, labels_used, title="True vs predicted marginals")
4) How prevalence changes \(\kappa\)#
Even if two classifiers have the same accuracy, \(\kappa\) can differ depending on class prevalence because \(p_e\) changes with the marginals.
Below we simulate a binary classifier with a fixed error rate: we flip the true label with probability \(q\). Accuracy stays at \(1-q\), but \(\kappa\) changes as the positive class prevalence moves toward 0 or 1.
def simulate_symmetric_noise(prevalences: np.ndarray, q: float, n: int = 50_000) -> tuple[np.ndarray, np.ndarray]:
accs: list[float] = []
kappas: list[float] = []
for pi in prevalences:
y = (rng.random(n) < pi).astype(int)
flip = rng.random(n) < q
y_hat = y.copy()
y_hat[flip] = 1 - y_hat[flip]
accs.append(float((y == y_hat).mean()))
kappas.append(cohen_kappa_score_numpy(y, y_hat))
return np.array(accs), np.array(kappas)
prevalences = np.linspace(0.02, 0.98, 60)
q = 0.10
accs, kappas = simulate_symmetric_noise(prevalences, q=q)
fig = go.Figure()
fig.add_trace(go.Scatter(x=prevalences, y=accs, mode="lines", name="accuracy"))
fig.add_trace(go.Scatter(x=prevalences, y=kappas, mode="lines", name="kappa"))
fig.update_layout(
title=f"Fixed error rate q={q:.2f}: accuracy vs kappa across prevalence",
xaxis_title="positive class prevalence π = P(y=1)",
yaxis_title="metric value",
)
fig.update_yaxes(range=[-0.1, 1.0])
fig
# Majority-class predictor: accuracy can be high, but kappa stays ~0
prevalences = np.linspace(0.02, 0.98, 60)
n = 50_000
accs: list[float] = []
kappas: list[float] = []
for pi in prevalences:
y = (rng.random(n) < pi).astype(int)
y_hat = np.zeros_like(y)
accs.append(float((y == y_hat).mean()))
kappas.append(cohen_kappa_score_numpy(y, y_hat))
fig = go.Figure()
fig.add_trace(go.Scatter(x=prevalences, y=accs, mode="lines", name="accuracy"))
fig.add_trace(go.Scatter(x=prevalences, y=kappas, mode="lines", name="kappa"))
fig.update_layout(
title="Always-predict-0 baseline across prevalence",
xaxis_title="positive class prevalence π = P(y=1)",
yaxis_title="metric value",
)
fig.update_yaxes(range=[-0.1, 1.0])
fig
5) Weighted \(\kappa\) (ordinal labels)#
If classes are ordered (e.g., 1–5 star ratings), “off by 1” is usually less severe than “off by 4”. Weighted kappa replaces exact agreement with a weighted disagreement.
Let \(O_{ij}=C_{ij}/N\) be the observed joint distribution and \(E_{ij}=r_i c_j\) the expected joint distribution. Choose disagreement weights \(w_{ij}\) with \(w_{ii}=0\).
Common choices:
linear: \(w_{ij}=\frac{|i-j|}{K-1}\)
quadratic: \(w_{ij}=\frac{(i-j)^2}{(K-1)^2}\)
These only make sense when the class order is meaningful.
def plot_weight_matrix(w: np.ndarray, labels: np.ndarray, title: str) -> go.Figure:
fig = px.imshow(
w,
x=[str(l) for l in labels],
y=[str(l) for l in labels],
text_auto=".2f",
color_continuous_scale="Reds",
title=title,
labels={"x": "pred", "y": "true", "color": "disagreement"},
)
fig.update_yaxes(autorange="reversed")
return fig
k = 5
labels = np.arange(k)
y_true = rng.integers(0, k, size=2_000)
noise = rng.choice([-2, -1, 0, 1, 2], size=y_true.size, p=[0.05, 0.15, 0.60, 0.15, 0.05])
y_pred = np.clip(y_true + noise, 0, k - 1)
k_unw = cohen_kappa_score_numpy(y_true, y_pred)
k_lin = cohen_kappa_score_numpy(y_true, y_pred, weights="linear")
k_quad = cohen_kappa_score_numpy(y_true, y_pred, weights="quadratic")
print(f"unweighted kappa = {k_unw:.3f}")
print(f"linear kappa = {k_lin:.3f}")
print(f"quadratic kappa = {k_quad:.3f}")
cm, labels_used = confusion_matrix_numpy(y_true, y_pred, labels=labels)
plot_confusion_matrix(cm, labels_used, title="Ordinal example: confusion matrix")
unweighted kappa = 0.601
linear kappa = 0.767
quadratic kappa = 0.879
w_lin = kappa_weight_matrix(k, "linear")
w_quad = kappa_weight_matrix(k, "quadratic")
fig = make_subplots(rows=1, cols=2, subplot_titles=["Linear weights", "Quadratic weights"])
fig.add_trace(
go.Heatmap(z=w_lin, x=labels.astype(str), y=labels.astype(str), colorscale="Reds", showscale=False),
row=1,
col=1,
)
fig.add_trace(
go.Heatmap(z=w_quad, x=labels.astype(str), y=labels.astype(str), colorscale="Reds", showscale=False),
row=1,
col=2,
)
fig.update_yaxes(autorange="reversed")
fig.update_layout(title="Disagreement weights (larger = more severe)")
fig
6) Using \(\kappa\) for optimization (threshold tuning)#
\(\kappa\) is computed from discrete predictions, so it is not differentiable with respect to model parameters. That means you usually don’t train a model by directly gradient-descenting on \(\kappa\).
A common pattern is:
Train a probabilistic model with a differentiable objective (e.g., log loss)
Pick a decision threshold on a validation set to maximize \(\kappa\)
Below: logistic regression trained from scratch with gradient descent, then a threshold sweep to maximize \(\kappa\).
def sigmoid(z: np.ndarray) -> np.ndarray:
z = np.asarray(z)
out = np.empty_like(z, dtype=float)
pos = z >= 0
out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
exp_z = np.exp(z[~pos])
out[~pos] = exp_z / (1.0 + exp_z)
return out
def standardize_fit(X: np.ndarray) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
mean = X.mean(axis=0)
std = X.std(axis=0)
std = np.where(std == 0.0, 1.0, std)
return (X - mean) / std, mean, std
def standardize_transform(X: np.ndarray, mean: np.ndarray, std: np.ndarray) -> np.ndarray:
return (X - mean) / std
def add_intercept(X: np.ndarray) -> np.ndarray:
return np.c_[np.ones((X.shape[0], 1)), X]
def log_loss_from_logits(y: np.ndarray, logits: np.ndarray) -> float:
# Stable binary cross-entropy: mean(log(1+exp(z)) - y*z)
return float(np.mean(np.logaddexp(0.0, logits) - y * logits))
def fit_logistic_regression_gd(
X: np.ndarray,
y: np.ndarray,
lr: float = 0.2,
n_iters: int = 2_000,
l2: float = 0.0,
record_every: int = 20,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
"""Return (weights, steps, losses). L2 excludes the intercept term."""
X_i = add_intercept(X)
y = y.astype(float)
w = np.zeros(X_i.shape[1], dtype=float)
steps: list[int] = []
losses: list[float] = []
for step in range(1, n_iters + 1):
logits = X_i @ w
p = sigmoid(logits)
grad = (X_i.T @ (p - y)) / y.size
grad[1:] += l2 * w[1:]
w -= lr * grad
if step % record_every == 0 or step == 1:
steps.append(step)
losses.append(log_loss_from_logits(y, logits) + 0.5 * l2 * float(w[1:] @ w[1:]))
return w, np.array(steps), np.array(losses)
def predict_proba_logreg(X: np.ndarray, w: np.ndarray) -> np.ndarray:
X_i = add_intercept(X)
return sigmoid(X_i @ w)
# Synthetic, slightly imbalanced binary classification problem
n0, n1 = 2_000, 500
mean0 = np.array([-1.2, -0.2])
mean1 = np.array([1.0, 0.9])
cov = np.array([[1.0, 0.45], [0.45, 1.0]])
X0 = rng.multivariate_normal(mean0, cov, size=n0)
X1 = rng.multivariate_normal(mean1, cov, size=n1)
X = np.vstack([X0, X1])
y = np.r_[np.zeros(n0, dtype=int), np.ones(n1, dtype=int)]
X_train, X_tmp, y_train, y_tmp = train_test_split(
X, y, test_size=0.4, stratify=y, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=42
)
# Standardize based on train only
X_train_s, mean, std = standardize_fit(X_train)
X_val_s = standardize_transform(X_val, mean, std)
X_test_s = standardize_transform(X_test, mean, std)
w, steps, losses = fit_logistic_regression_gd(X_train_s, y_train, lr=0.2, n_iters=2_000, l2=0.1)
fig = px.line(x=steps, y=losses, title="Logistic regression (GD): training objective")
fig.update_layout(xaxis_title="iteration", yaxis_title="log loss + L2")
fig
def sweep_thresholds(y_true: np.ndarray, proba: np.ndarray, thresholds: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
kappas = np.empty_like(thresholds, dtype=float)
accs = np.empty_like(thresholds, dtype=float)
for i, t in enumerate(thresholds):
y_hat = (proba >= t).astype(int)
kappas[i] = cohen_kappa_score_numpy(y_true, y_hat)
accs[i] = float((y_true == y_hat).mean())
return kappas, accs
proba_val = predict_proba_logreg(X_val_s, w)
thresholds = np.linspace(0.01, 0.99, 99)
kappas, accs = sweep_thresholds(y_val, proba_val, thresholds)
best_idx = int(np.argmax(kappas))
best_t = float(thresholds[best_idx])
print(f"best threshold (val) = {best_t:.2f}")
print(f"best kappa (val) = {kappas[best_idx]:.3f}")
print(f"accuracy @ best_t = {accs[best_idx]:.3f}")
fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=kappas, mode="lines", name="kappa"))
fig.add_trace(go.Scatter(x=thresholds, y=accs, mode="lines", name="accuracy", opacity=0.7))
fig.add_vline(x=best_t, line_dash="dash", line_color="black")
fig.update_layout(
title="Validation sweep: kappa and accuracy vs threshold",
xaxis_title="threshold",
yaxis_title="metric value",
)
fig
best threshold (val) = 0.35
best kappa (val) = 0.764
accuracy @ best_t = 0.926
def eval_at_threshold(y_true: np.ndarray, proba: np.ndarray, t: float) -> dict[str, float]:
y_hat = (proba >= t).astype(int)
cm, _ = confusion_matrix_numpy(y_true, y_hat, labels=np.array([0, 1]))
return {
"threshold": float(t),
"accuracy": float((y_true == y_hat).mean()),
"kappa": cohen_kappa_score_numpy(y_true, y_hat),
"tn": float(cm[0, 0]),
"fp": float(cm[0, 1]),
"fn": float(cm[1, 0]),
"tp": float(cm[1, 1]),
}
proba_test = predict_proba_logreg(X_test_s, w)
m_default = eval_at_threshold(y_test, proba_test, t=0.5)
m_best = eval_at_threshold(y_test, proba_test, t=best_t)
print("test @ t=0.50:", m_default)
print("test @ best_t:", m_best)
cm_default, _ = confusion_matrix_numpy(y_test, (proba_test >= 0.5).astype(int), labels=np.array([0, 1]))
cm_best, _ = confusion_matrix_numpy(y_test, (proba_test >= best_t).astype(int), labels=np.array([0, 1]))
fig = make_subplots(rows=1, cols=2, subplot_titles=["t = 0.50", f"t = {best_t:.2f}"])
fig.add_trace(go.Heatmap(z=cm_default, x=["0", "1"], y=["0", "1"], colorscale="Blues", showscale=False), row=1, col=1)
fig.add_trace(go.Heatmap(z=cm_best, x=["0", "1"], y=["0", "1"], colorscale="Blues", showscale=False), row=1, col=2)
fig.update_yaxes(autorange="reversed")
fig.update_layout(title="Test confusion matrices: default vs kappa-optimized threshold", xaxis_title="pred", yaxis_title="true")
fig
test @ t=0.50: {'threshold': 0.5, 'accuracy': 0.88, 'kappa': 0.5161290322580645, 'tn': 400.0, 'fp': 0.0, 'fn': 60.0, 'tp': 40.0}
test @ best_t: {'threshold': 0.35000000000000003, 'accuracy': 0.9, 'kappa': 0.6727748691099477, 'tn': 381.0, 'fp': 19.0, 'fn': 31.0, 'tp': 69.0}
7) Pros, cons, and good use cases#
Pros
Chance-corrected agreement: unlike accuracy, it explicitly subtracts expected agreement from the marginals
Works for binary and multiclass classification
Weighted variants handle ordinal labels naturally
Cons / caveats
Sensitive to class prevalence and prediction bias (the “prevalence paradox”): the same accuracy can yield different \(\kappa\)
Not defined for probabilistic outputs; you must pick a threshold (or argmax) first
Not differentiable → usually not a direct training loss (use it for model selection / threshold tuning)
Can be unstable on small datasets (marginals estimated with high variance)
Good use cases
Inter-annotator reliability (two humans labeling the same items)
Evaluating classifiers when you care about going beyond “matching prevalence” baselines
Ordinal classification (use weighted \(\kappa\))
Exercises#
Construct a dataset where accuracy increases but \(\kappa\) decreases (hint: shift the marginals).
For the threshold-sweep demo, optimize for accuracy instead of \(\kappa\). How do the chosen thresholds differ?
Create your own custom weight matrix for ordinal labels (e.g., asymmetric penalties) and compute \(\kappa_w\).
References#
scikit-learn API:
sklearn.metrics.cohen_kappa_scoreCohen, J. (1960). A coefficient of agreement for nominal scales.
Weighted kappa: commonly used for ordinal agreement (e.g., clinical ratings).